Metrics with Prometheus
What You Will Learn
- The Prometheus data model and why it differs from traditional monitoring systems
- All four Prometheus metric types with production use cases and common mistakes
- How to expose metrics from a FastAPI service with auto-instrumentation
- How to write custom application metrics for a document processing service
- Ten production PromQL queries every SRE should know
- How to write Alertmanager rules and Grafana dashboards from JSON
Prerequisites
| Requirement | Details |
|---|---|
| Python 3.11+ | Type hints used throughout |
| FastAPI basics | All examples instrument FastAPI |
prometheus-client, prometheus-fastapi-instrumentator | pip install prometheus-client prometheus-fastapi-instrumentator |
| Docker + docker-compose | Prometheus, Alertmanager, Grafana run in containers |
| Lesson 01 complete | Logging context assumed |
The Incident: 3 AM, p99 > 2 Seconds
PagerDuty fires at 03:17. The alert: p99 latency > 2s on document-api. Your on-call rotation just woke you up.
You open the logs. They look like this:
INFO POST /api/documents 1891ms
INFO POST /api/documents 2103ms
INFO POST /api/documents 847ms
INFO POST /api/documents 3201ms
Slow, but no errors. You have no metrics. Without metrics, your investigation looks like this:
- 03:17 - Wake up, look at logs
- 03:22 - Try to reproduce locally (cannot, it's a production load pattern)
- 03:31 - Start adding
time.time()instrumentation to guess where the slowness is - 03:48 - Realise you need to deploy the change and wait for production traffic
- 04:02 - Still investigating; users are complaining
Now imagine you have a Prometheus histogram tracking latency by route and a custom metric tracking document_size_bytes as a label. A single PromQL query at 03:17 shows:
histogram_quantile(0.99,
sum by (le, http_route, document_size_bucket) (
rate(http_request_duration_seconds_bucket[5m])
)
)
The result: p99 is high only on POST /api/documents when document_size_bucket is "large". Large documents are slow. The entire root cause analysis takes 90 seconds.
This lesson is about having those metrics in place before you need them.
1. The Prometheus Data Model
Prometheus stores time series - sequences of (timestamp, float64) pairs, identified by a metric name and a set of labels.
http_request_duration_seconds_bucket{
method="POST",
route="/api/documents",
status_code="200",
le="0.5"
} = 1847
- Metric name:
http_request_duration_seconds_bucket - Labels: key-value pairs that give the metric its dimensions
- Value: a float64 (here, the count of requests with duration ≤ 0.5s)
- Timestamp: added by Prometheus at scrape time
Cardinality: The Critical Constraint
Every unique combination of label values creates a new time series. This is cardinality. High cardinality destroys Prometheus performance.
# CORRECT: Low cardinality labels
request_counter.labels(
method="POST",
route="/api/documents", # Fixed set of routes
status_code="200", # 5xx, 4xx, 2xx or exact codes
).inc()
# WRONG: High cardinality - explodes Prometheus
request_counter.labels(
user_id="usr_4492", # Millions of unique user IDs!
document_id="doc_8f3a", # Billions of documents!
request_id="req_7e9d", # Unique per request!
).inc()
Rule: Labels should have bounded cardinality. Anything with more than ~100 unique values is suspicious. Anything per-user or per-request belongs in logs, not metrics.
The Prometheus Scrape Model
Prometheus pulls metrics by scraping HTTP endpoints at regular intervals (typically 15s). Your Python service exposes a /metrics endpoint; Prometheus polls it.
┌─────────────────┐ scrape every 15s ┌─────────────────┐
│ Python Service │ ─────────────────────► │ Prometheus │
│ :8001/metrics │ │ (stores TSDB) │
└─────────────────┘ └────────┬────────┘
│ query
▼
┌─────────────────┐
│ Grafana │
│ Dashboards │
└─────────────────┘
2. Counter
A counter is a monotonically increasing value. It only goes up (and resets to zero when the process restarts). Use it for events that accumulate: requests, errors, bytes processed, messages consumed.
from prometheus_client import Counter
# Define at module level - Prometheus registers metrics globally
http_requests_total = Counter(
"http_requests_total", # Metric name
"Total HTTP requests received", # Help text (shown in /metrics)
["method", "route", "status_code"], # Label names
)
errors_total = Counter(
"app_errors_total",
"Total application errors by type",
["error_type", "component"],
)
documents_processed_total = Counter(
"documents_processed_total",
"Total documents processed",
["content_type", "status"],
)
Using Counters
# Increment by 1 (most common)
http_requests_total.labels(
method="POST",
route="/api/documents",
status_code="200",
).inc()
# Increment by N
documents_processed_total.labels(
content_type="application/pdf",
status="success",
).inc(batch_size)
# Track exceptions
try:
result = process_document(doc)
except ValidationError as e:
errors_total.labels(
error_type="ValidationError",
component="document_processor",
).inc()
raise
PromQL for Counters
# Requests per second over the last 5 minutes
rate(http_requests_total[5m])
# Error rate (errors per second)
rate(app_errors_total[5m])
# Error percentage
(
rate(http_requests_total{status_code=~"5.."}[5m])
/
rate(http_requests_total[5m])
) * 100
# Total requests in the last hour (handles counter resets)
increase(http_requests_total[1h])
# Top 5 error types
topk(5, sum by (error_type) (rate(app_errors_total[5m])))
Counter Pitfalls
# WRONG: Using a counter for something that can go down
active_connections = Counter("active_connections", "...") # Wrong!
# This cannot go down - use a Gauge
# WRONG: Calling .inc() with a float > 1 when you mean events
requests.inc(response_time) # Wrong - this is not counting requests
# Use a Histogram for response time
# CORRECT: Counting bytes (a counter, because bytes never decrease)
bytes_sent = Counter("bytes_sent_total", "Total bytes sent")
bytes_sent.inc(len(response_body))
3. Gauge
A gauge is a value that can go up and down. Use it for current state: active connections, queue depth, memory usage, number of items in a cache, temperature.
from prometheus_client import Gauge
# Database connection pool
db_connections_active = Gauge(
"db_connections_active",
"Number of currently active database connections",
["pool_name"],
)
db_connections_idle = Gauge(
"db_connections_idle",
"Number of idle database connections in the pool",
["pool_name"],
)
# Message queue
queue_depth = Gauge(
"document_queue_depth",
"Number of documents waiting to be processed",
["queue_name", "priority"],
)
# Memory (custom, in addition to process_* metrics Prometheus provides)
model_memory_bytes = Gauge(
"ml_model_memory_bytes",
"Memory used by loaded ML models",
["model_name", "version"],
)
Using Gauges
# Set to a specific value
db_connections_active.labels(pool_name="primary").set(pool.checked_out)
db_connections_idle.labels(pool_name="primary").set(pool.idle_count)
# Increment and decrement
queue_depth.labels(queue_name="document_processing", priority="high").inc()
# ... after processing:
queue_depth.labels(queue_name="document_processing", priority="high").dec()
# Use as a context manager - automatically inc on enter, dec on exit
with queue_depth.labels(queue_name="document_processing", priority="high").track_inprogress():
process_document(doc)
# Track function execution time as a gauge
@model_memory_bytes.labels(model_name="classifier", version="1.0").track_inprogress()
def load_model():
...
# Set to current Unix timestamp - useful for "last successful run" gauges
last_backup_timestamp = Gauge(
"last_backup_timestamp_seconds",
"Unix timestamp of the last successful database backup",
)
last_backup_timestamp.set_to_current_time()
PromQL for Gauges
# Current queue depth
document_queue_depth{queue_name="document_processing"}
# Connection pool utilisation percentage
(
db_connections_active{pool_name="primary"}
/
(db_connections_active{pool_name="primary"} + db_connections_idle{pool_name="primary"})
) * 100
# Time since last backup (seconds)
time() - last_backup_timestamp_seconds
# Alert if queue depth has been high for 5 minutes
document_queue_depth > 100
4. Histogram
A histogram is the most powerful and most misunderstood Prometheus metric type. It tracks the distribution of observed values (like request durations) across configurable buckets.
How Histograms Work
For each observation (e.g., a request that took 347ms), Prometheus increments:
- All
_bucketcounters wherele(less than or equal) >= the observed value - The
_countcounter (total observations) - The
_sumcounter (sum of all observed values)
http_request_duration_seconds_bucket{le="0.1"} = 892 # requests <= 100ms
http_request_duration_seconds_bucket{le="0.25"} = 1841 # requests <= 250ms
http_request_duration_seconds_bucket{le="0.5"} = 2103 # requests <= 500ms
http_request_duration_seconds_bucket{le="1.0"} = 2144 # requests <= 1000ms
http_request_duration_seconds_bucket{le="2.5"} = 2147 # requests <= 2500ms
http_request_duration_seconds_bucket{le="+Inf"} = 2147 # all requests
http_request_duration_seconds_count = 2147
http_request_duration_seconds_sum = 892.4
Defining Histograms with the Right Buckets
Bucket selection is critical. If your SLO is "p99 < 500ms", you need buckets around 500ms to calculate it accurately.
from prometheus_client import Histogram
# Request latency - buckets in seconds, chosen for a web API SLO of p99 < 1s
http_request_duration_seconds = Histogram(
"http_request_duration_seconds",
"HTTP request duration in seconds",
["method", "route", "status_class"],
# Dense around the SLO target, sparse elsewhere
buckets=[
0.005, # 5ms
0.01, # 10ms
0.025, # 25ms
0.05, # 50ms
0.1, # 100ms
0.25, # 250ms
0.5, # 500ms - SLO boundary
0.75, # 750ms
1.0, # 1s
2.5, # 2.5s
5.0, # 5s
10.0, # 10s
float("inf"), # Always include +Inf
],
)
# Document processing time - buckets in seconds, much wider range
document_processing_seconds = Histogram(
"document_processing_seconds",
"Time to fully process a document",
["content_type", "page_count_bucket"],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0, float("inf")],
)
# Model inference latency - tight buckets for fast ML inference
model_inference_seconds = Histogram(
"model_inference_seconds",
"Time for ML model inference",
["model_name", "batch_size_bucket"],
buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, float("inf")],
)
Using Histograms
import time
from contextlib import contextmanager
# Method 1: Manual timing
start = time.perf_counter()
result = process_document(doc)
duration = time.perf_counter() - start
document_processing_seconds.labels(
content_type="application/pdf",
page_count_bucket="10-50",
).observe(duration)
# Method 2: Context manager (cleaner)
with document_processing_seconds.labels(
content_type="application/pdf",
page_count_bucket="10-50",
).time():
result = process_document(doc)
# Method 3: Decorator
@http_request_duration_seconds.labels(
method="POST",
route="/api/documents",
status_class="2xx",
).time()
def handle_upload(request):
...
PromQL for Histograms
# p50, p95, p99 latency over last 5 minutes
histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
# p99 broken down by route - find the slow endpoint
histogram_quantile(0.99,
sum by (le, route) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# Average request duration (sum / count)
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])
# Percentage of requests completing within 500ms
(
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
) * 100
Choosing Bucket Boundaries
| Scenario | Recommended Buckets (seconds) |
|---|---|
| Real-time API (SLO: p99 < 100ms) | 0.001, 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 1.0, +Inf |
| Standard API (SLO: p99 < 1s) | 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, +Inf |
| Background jobs (SLO: p99 < 30s) | 0.1, 0.5, 1.0, 5.0, 10.0, 15.0, 30.0, 60.0, 120.0, +Inf |
| Batch processing (SLO: p99 < 5min) | 1.0, 5.0, 15.0, 30.0, 60.0, 120.0, 300.0, 600.0, +Inf |
5. Summary: When to Use vs Histogram
| Aspect | Histogram | Summary |
|---|---|---|
| Quantile calculation | Server-side in Prometheus (PromQL) | Client-side in the application |
| Aggregation across instances | Yes - sum by (le) across replicas | No - quantiles cannot be summed |
| Memory cost | Fixed (number of buckets × label combinations) | Grows with sliding window size |
| Accuracy | Approximate (depends on bucket boundaries) | Configurable precision |
| Best for | Most use cases; SLO alerting | Single-instance services; precise quantiles needed locally |
Recommendation: Use Histogram for almost everything. Use Summary only if you have a single-instance service and need exact quantiles locally, and you will never need to aggregate across multiple replicas.
from prometheus_client import Summary
# Summary example - avoid in multi-instance deployments
request_latency_summary = Summary(
"request_latency_seconds",
"Request latency",
["route"],
)
# Declares 0.5, 0.9, 0.99 quantiles by default
# This calculates quantiles IN the Python process, over a sliding window
6. FastAPI Auto-Instrumentation
prometheus-fastapi-instrumentator automatically instruments every route with request count, latency histograms, and in-progress request gauges.
# pip install prometheus-fastapi-instrumentator
from fastapi import FastAPI
from prometheus_fastapi_instrumentator import Instrumentator, metrics
app = FastAPI()
# Create instrumentator with production settings
instrumentator = Instrumentator(
# Exclude internal endpoints from metrics
should_group_status_codes=True,
should_ignore_untemplated=True, # Ignore routes without path params defined
should_respect_env_var=True, # Disable via ENABLE_METRICS=false
excluded_handlers=[
"/metrics", # Don't track the metrics endpoint itself
"/health", # Don't track health checks
"/liveness",
"/readiness",
"/docs",
"/openapi.json",
],
body_handlers=None,
inprogress_name="http_requests_inprogress",
inprogress_labels=True,
)
# Add default metrics (latency histogram, request count, in-progress)
instrumentator.add(
metrics.request_size(
metric_name="http_request_size_bytes",
metric_doc="HTTP request body size in bytes",
)
).add(
metrics.response_size(
metric_name="http_response_size_bytes",
metric_doc="HTTP response body size in bytes",
)
).add(
metrics.latency(
metric_name="http_request_duration_seconds",
metric_doc="HTTP request latency in seconds",
buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
)
).add(
metrics.requests(
metric_name="http_requests_total",
metric_doc="Total HTTP requests",
)
)
# Mount /metrics endpoint (must be before app.include_router calls)
instrumentator.instrument(app).expose(
app,
endpoint="/metrics",
include_in_schema=False,
tags=["monitoring"],
)
What the /metrics Endpoint Produces
After instrumentation, GET /metrics returns:
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{handler="/api/documents",method="POST",status="2xx"} 2147.0
http_requests_total{handler="/api/documents",method="POST",status="4xx"} 13.0
http_requests_total{handler="/api/documents",method="POST",status="5xx"} 2.0
# HELP http_request_duration_seconds HTTP request latency in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{handler="/api/documents",le="0.005"} 0.0
http_request_duration_seconds_bucket{handler="/api/documents",le="0.01"} 12.0
http_request_duration_seconds_bucket{handler="/api/documents",le="0.025"} 89.0
...
http_request_duration_seconds_count{handler="/api/documents"} 2162.0
http_request_duration_seconds_sum{handler="/api/documents"} 1087.4
# HELP http_requests_inprogress HTTP requests currently in progress
# TYPE http_requests_inprogress gauge
http_requests_inprogress{handler="/api/documents",method="POST"} 3.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 4.26844160e+08
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 8731.0
python_gc_objects_collected_total{generation="1"} 492.0
python_gc_objects_collected_total{generation="2"} 12.0
7. Custom Application Metrics
Auto-instrumentation covers HTTP-level metrics. You also need application-level metrics that reflect your business logic.
Complete Metrics Module for a Document Processing Service
# app/metrics.py
"""
Application-level Prometheus metrics.
Import this module once at startup. Metric objects are singletons
registered in the global Prometheus registry.
"""
from prometheus_client import Counter, Gauge, Histogram, Info
# ─── Document Processing ────────────────────────────────────────────────────
documents_received_total = Counter(
"documents_received_total",
"Total documents received for processing",
["content_type", "source"],
)
documents_processed_total = Counter(
"documents_processed_total",
"Total documents that completed processing",
["content_type", "status"], # status: success | validation_error | processing_error
)
document_processing_duration_seconds = Histogram(
"document_processing_duration_seconds",
"End-to-end time to process a document",
["content_type", "page_count_bucket"],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, float("inf")],
)
document_size_bytes = Histogram(
"document_size_bytes",
"Size of received documents in bytes",
["content_type"],
buckets=[
1_024, # 1 KB
10_240, # 10 KB
102_400, # 100 KB
1_048_576, # 1 MB
10_485_760, # 10 MB
52_428_800, # 50 MB
float("inf"),
],
)
documents_in_queue = Gauge(
"documents_in_queue",
"Number of documents currently waiting in the processing queue",
["priority"],
)
# ─── ML Model Metrics ───────────────────────────────────────────────────────
model_inference_duration_seconds = Histogram(
"model_inference_duration_seconds",
"Time for model inference",
["model_name", "model_version"],
buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, float("inf")],
)
model_inference_requests_total = Counter(
"model_inference_requests_total",
"Total model inference requests",
["model_name", "model_version", "status"],
)
# ─── Cache Metrics ──────────────────────────────────────────────────────────
cache_operations_total = Counter(
"cache_operations_total",
"Total cache operations",
["cache_name", "operation", "result"], # result: hit | miss | error
)
cache_items_current = Gauge(
"cache_items_current",
"Current number of items in the cache",
["cache_name"],
)
# ─── Database Metrics ───────────────────────────────────────────────────────
db_query_duration_seconds = Histogram(
"db_query_duration_seconds",
"Database query execution time",
["operation", "table"], # operation: select | insert | update | delete
buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, float("inf")],
)
db_connections_active = Gauge(
"db_connections_active",
"Active database connections",
["pool_name"],
)
db_connections_idle = Gauge(
"db_connections_idle",
"Idle database connections",
["pool_name"],
)
db_connection_pool_size = Gauge(
"db_connection_pool_size",
"Total database connection pool size",
["pool_name"],
)
# ─── External API Metrics ───────────────────────────────────────────────────
external_api_duration_seconds = Histogram(
"external_api_duration_seconds",
"Time to complete external API calls",
["service", "operation", "status_class"],
buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, float("inf")],
)
external_api_errors_total = Counter(
"external_api_errors_total",
"Total external API errors",
["service", "operation", "error_type"],
)
# ─── Service Info ───────────────────────────────────────────────────────────
service_info = Info(
"service",
"Service metadata",
)
def initialise_service_info(name: str, version: str, environment: str) -> None:
"""Call once at startup to set service metadata in metrics."""
service_info.info({
"name": name,
"version": version,
"environment": environment,
})
Using Metrics in Application Code
# app/services/document_processor.py
import time
from app import metrics
class DocumentProcessor:
async def process(self, content: bytes, content_type: str, priority: str) -> Document:
# Track document receipt
metrics.documents_received_total.labels(
content_type=content_type,
source="api",
).inc()
metrics.document_size_bytes.labels(
content_type=content_type,
).observe(len(content))
metrics.documents_in_queue.labels(priority=priority).inc()
start = time.perf_counter()
try:
doc = await self._do_process(content, content_type)
status = "success"
return doc
except ValidationError:
status = "validation_error"
raise
except Exception:
status = "processing_error"
raise
finally:
duration = time.perf_counter() - start
page_count = getattr(doc, "page_count", 0) if status == "success" else 0
metrics.documents_processed_total.labels(
content_type=content_type,
status=status,
).inc()
metrics.document_processing_duration_seconds.labels(
content_type=content_type,
page_count_bucket=_bucket_page_count(page_count),
).observe(duration)
metrics.documents_in_queue.labels(priority=priority).dec()
async def _do_model_inference(self, text: str, model_name: str) -> dict:
start = time.perf_counter()
try:
result = await self.model.predict(text)
metrics.model_inference_requests_total.labels(
model_name=model_name,
model_version="1.0",
status="success",
).inc()
return result
except Exception:
metrics.model_inference_requests_total.labels(
model_name=model_name,
model_version="1.0",
status="error",
).inc()
raise
finally:
duration = time.perf_counter() - start
metrics.model_inference_duration_seconds.labels(
model_name=model_name,
model_version="1.0",
).observe(duration)
def _bucket_page_count(pages: int) -> str:
if pages == 0:
return "unknown"
if pages <= 5:
return "1-5"
if pages <= 20:
return "6-20"
if pages <= 100:
return "21-100"
return "100+"
8. PromQL Essentials
Ten queries every SRE working with Python services needs to know:
# 1. Request rate per second, last 5 minutes, by route
sum by (handler) (
rate(http_requests_total[5m])
)
# 2. Error rate as a percentage of total requests
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100
# 3. p50/p95/p99 latency by route
histogram_quantile(0.99,
sum by (le, handler) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# 4. Apdex score (target = 500ms, tolerable = 2s)
# Apdex = (satisfied + 0.5 * tolerating) / total
(
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
+
0.5 * (
sum(rate(http_request_duration_seconds_bucket{le="2.0"}[5m]))
-
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
)
) / sum(rate(http_request_duration_seconds_count[5m]))
# 5. Database connection pool saturation
(
db_connections_active{pool_name="primary"}
/
db_connection_pool_size{pool_name="primary"}
) * 100
# 6. Cache hit rate
(
sum(rate(cache_operations_total{result="hit"}[5m]))
/
sum(rate(cache_operations_total{result=~"hit|miss"}[5m]))
) * 100
# 7. Document processing throughput (docs/sec)
sum(rate(documents_processed_total{status="success"}[5m]))
# 8. Average document processing time
rate(document_processing_duration_seconds_sum[5m])
/
rate(document_processing_duration_seconds_count[5m])
# 9. External API error rate by service
sum by (service) (
rate(external_api_errors_total[5m])
)
# 10. Python GC pause rate (from auto-collected process metrics)
rate(python_gc_collections_total[5m])
9. Alerting Rules
Alerting rules are evaluated by Prometheus at a configurable interval. When a rule's expression evaluates to a non-empty set of time series, Prometheus fires the alert to Alertmanager.
docker-compose Setup
# docker-compose.yml additions
prometheus:
image: prom/prometheus:v2.50.1
ports:
- "9090:9090"
volumes:
- ./config/prometheus.yml:/etc/prometheus/prometheus.yml
- ./config/alerts.yml:/etc/prometheus/alerts.yml
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.retention.time=15d"
alertmanager:
image: prom/alertmanager:v0.26.0
ports:
- "9093:9093"
volumes:
- ./config/alertmanager.yml:/etc/alertmanager/alertmanager.yml
# config/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- "alerts.yml"
scrape_configs:
- job_name: "document-api"
static_configs:
- targets: ["app:8001"]
metrics_path: "/metrics"
Five Production Alerting Rules
# config/alerts.yml
groups:
- name: document_api_alerts
interval: 30s
rules:
# 1. High Error Rate
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100 > 1
for: 2m
labels:
severity: critical
team: backend
annotations:
summary: "High HTTP error rate on {{ $labels.job }}"
description: >
Error rate is {{ $value | printf "%.2f" }}%
(threshold: 1%) for the last 2 minutes.
runbook: "https://wiki.example.com/runbooks/high-error-rate"
dashboard: "https://grafana.example.com/d/abc123"
# 2. High p99 Latency
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
sum by (le, handler) (
rate(http_request_duration_seconds_bucket[5m])
)
) > 2.0
for: 3m
labels:
severity: warning
team: backend
annotations:
summary: "p99 latency > 2s on {{ $labels.handler }}"
description: >
p99 latency is {{ $value | printf "%.3f" }}s
on route {{ $labels.handler }}.
SLO threshold is 1.0s.
# 3. Low Cache Hit Rate
- alert: LowCacheHitRate
expr: |
(
sum by (cache_name) (rate(cache_operations_total{result="hit"}[10m]))
/
sum by (cache_name) (rate(cache_operations_total{result=~"hit|miss"}[10m]))
) * 100 < 60
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "Cache hit rate below 60% on {{ $labels.cache_name }}"
description: >
Cache hit rate is {{ $value | printf "%.1f" }}%
on {{ $labels.cache_name }}.
This may indicate cache invalidation issues or increased unique traffic.
# 4. Database Connection Pool Near Exhaustion
- alert: DatabaseConnectionPoolSaturation
expr: |
(
db_connections_active{pool_name="primary"}
/
db_connection_pool_size{pool_name="primary"}
) * 100 > 80
for: 1m
labels:
severity: critical
team: backend
annotations:
summary: "DB connection pool > 80% utilised"
description: >
{{ $value | printf "%.1f" }}% of the primary database connection
pool is in use. At 100%, new requests will queue or fail.
Current active: {{ with query "db_connections_active{pool_name='primary'}" }}{{ . | first | value | printf "%.0f" }}{{ end }}
# 5. Service Down (no scrape data for 2 minutes)
- alert: ServiceDown
expr: |
up{job="document-api"} == 0
for: 2m
labels:
severity: critical
team: backend
page: "true"
annotations:
summary: "document-api is unreachable"
description: >
Prometheus cannot scrape {{ $labels.instance }}.
The service may be crashed or the /metrics endpoint is broken.
Alertmanager Configuration
# config/alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
route:
group_by: ["alertname", "job", "severity"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: "slack-critical"
routes:
- match:
severity: critical
page: "true"
receiver: "pagerduty"
- match:
severity: warning
receiver: "slack-warnings"
receivers:
- name: "slack-critical"
slack_configs:
- channel: "#incidents"
send_resolved: true
title: "{{ .CommonAnnotations.summary }}"
text: "{{ .CommonAnnotations.description }}"
- name: "slack-warnings"
slack_configs:
- channel: "#alerts"
send_resolved: true
- name: "pagerduty"
pagerduty_configs:
- routing_key: "YOUR_PAGERDUTY_INTEGRATION_KEY"
description: "{{ .CommonAnnotations.summary }}"
10. Grafana Dashboard
A complete dashboard JSON for a document processing service. Import this via Grafana UI → Dashboards → Import → Paste JSON.
{
"title": "Document API - Service Dashboard",
"uid": "doc-api-v1",
"schemaVersion": 38,
"time": {"from": "now-1h", "to": "now"},
"refresh": "30s",
"panels": [
{
"id": 1,
"title": "Request Rate (req/s)",
"type": "timeseries",
"gridPos": {"x": 0, "y": 0, "w": 6, "h": 8},
"targets": [{
"expr": "sum(rate(http_requests_total[5m]))",
"legendFormat": "Total req/s"
}]
},
{
"id": 2,
"title": "Error Rate (%)",
"type": "timeseries",
"gridPos": {"x": 6, "y": 0, "w": 6, "h": 8},
"targets": [{
"expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
"legendFormat": "5xx Error Rate %"
}],
"fieldConfig": {
"defaults": {"thresholds": {"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 0.1},
{"color": "red", "value": 1}
]}}
}
},
{
"id": 3,
"title": "Latency Percentiles",
"type": "timeseries",
"gridPos": {"x": 12, "y": 0, "w": 12, "h": 8},
"targets": [
{
"expr": "histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))",
"legendFormat": "p99"
}
]
},
{
"id": 4,
"title": "Active Connections",
"type": "stat",
"gridPos": {"x": 0, "y": 8, "w": 4, "h": 4},
"targets": [{
"expr": "http_requests_inprogress",
"legendFormat": "In Progress"
}]
},
{
"id": 5,
"title": "DB Pool Utilisation %",
"type": "gauge",
"gridPos": {"x": 4, "y": 8, "w": 4, "h": 4},
"targets": [{
"expr": "(db_connections_active{pool_name=\"primary\"} / db_connection_pool_size{pool_name=\"primary\"}) * 100",
"legendFormat": "Pool Util %"
}],
"fieldConfig": {
"defaults": {
"min": 0, "max": 100, "unit": "percent",
"thresholds": {"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 60},
{"color": "red", "value": 80}
]}
}
}
},
{
"id": 6,
"title": "Cache Hit Rate %",
"type": "gauge",
"gridPos": {"x": 8, "y": 8, "w": 4, "h": 4},
"targets": [{
"expr": "(sum(rate(cache_operations_total{result=\"hit\"}[5m])) / sum(rate(cache_operations_total{result=~\"hit|miss\"}[5m]))) * 100",
"legendFormat": "Hit Rate %"
}]
},
{
"id": 7,
"title": "Memory Usage (RSS)",
"type": "timeseries",
"gridPos": {"x": 0, "y": 12, "w": 8, "h": 8},
"targets": [{
"expr": "process_resident_memory_bytes",
"legendFormat": "RSS Memory"
}],
"fieldConfig": {"defaults": {"unit": "bytes"}}
},
{
"id": 8,
"title": "Python GC Collections/s",
"type": "timeseries",
"gridPos": {"x": 8, "y": 12, "w": 8, "h": 8},
"targets": [{
"expr": "sum by (generation) (rate(python_gc_collections_total[5m]))",
"legendFormat": "Gen {{generation}}"
}]
},
{
"id": 9,
"title": "Document Processing Rate",
"type": "timeseries",
"gridPos": {"x": 0, "y": 20, "w": 12, "h": 8},
"targets": [
{
"expr": "sum by (content_type) (rate(documents_processed_total{status=\"success\"}[5m]))",
"legendFormat": "{{content_type}} success/s"
},
{
"expr": "sum by (content_type) (rate(documents_processed_total{status!=\"success\"}[5m]))",
"legendFormat": "{{content_type}} error/s"
}
]
},
{
"id": 10,
"title": "Model Inference p99 (s)",
"type": "timeseries",
"gridPos": {"x": 12, "y": 20, "w": 12, "h": 8},
"targets": [{
"expr": "histogram_quantile(0.99, sum by (le, model_name) (rate(model_inference_duration_seconds_bucket[5m])))",
"legendFormat": "{{model_name}} p99"
}]
}
]
}
Prometheus + docker-compose: Complete Setup
# config/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 30s
external_labels:
environment: "production"
cluster: "primary"
scrape_configs:
- job_name: "document-api"
metrics_path: "/metrics"
scrape_timeout: 10s
static_configs:
- targets: ["app:8001"]
labels:
service: "document-api"
To verify your metrics are being scraped:
# Local development
curl http://localhost:8001/metrics | grep documents_processed
# PromQL via HTTP API
curl 'http://localhost:9090/api/v1/query?query=up{job="document-api"}'
Interview Questions and Answers
Q1: A Prometheus histogram has buckets at [0.1, 0.5, 1.0, 5.0, +Inf]. Your SLO is p99 < 800ms. Can you compute an accurate p99 from this histogram?
No. Prometheus computes histogram_quantile by linear interpolation within the bucket that contains the quantile. If the true p99 is 800ms, it falls in the bucket (0.5, 1.0]. Prometheus will linearly interpolate between 0.5s and 1.0s, giving an answer somewhere in that range, but it cannot tell you the exact value is 800ms. For an SLO target of 800ms, you need a bucket boundary at exactly 0.8 (or close to it) to get accurate alerting. Add 0.8 to your bucket list.
Q2: Your team wants to track which user ID is making the slowest requests by adding user_id as a Prometheus label. Why is this problematic, and what is the right approach?
Adding user_id as a label creates one time series per unique user ID. With 1 million users, you get 1 million time series just for that metric. Prometheus stores all active time series in RAM - this would consume gigabytes of memory and make Prometheus unusable. The right approach: keep user_id in logs (where cardinality is unlimited), and use a Prometheus histogram to track the distribution of slow requests. To find slow users, query logs with duration_ms > 2000 in Loki/Kibana. If you need per-user metrics for billing or SLAs, use a dedicated time series database like InfluxDB or a columnar store, not Prometheus.
Q3: What is the difference between rate() and increase() in PromQL, and when should you use each?
rate(counter[5m]) computes the per-second average rate of increase over the time window. It handles counter resets (process restarts). increase(counter[5m]) is just rate(counter[5m]) * 300 (the window in seconds) - it gives the total increase over the window. Use rate() for alerting rules and graphs (it gives a stable per-second value regardless of window size). Use increase() when you want to see "how many events happened in the last hour" in a human-readable form. Never use increase() for alerting - it is not normalized to a rate.
Q4: How does histogram_quantile work across multiple service replicas in Kubernetes?
Because histograms aggregate at the bucket level, you can sum buckets across replicas before calculating the quantile. The correct query is:
histogram_quantile(0.99,
sum by (le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
The sum by (le) adds together the bucket counts from all replicas for each le value, then histogram_quantile computes the quantile from the combined distribution. This is why histograms are preferred over summaries for multi-instance deployments - Summary quantiles are computed per-instance and cannot be meaningfully summed.
Q5: Your Python service starts up and immediately registers dozens of Prometheus metrics. Another service team complains that their metrics are appearing in your /metrics endpoint. Why, and how do you fix it?
Prometheus uses a global default registry (prometheus_client.REGISTRY). All metrics registered with Counter(...), Gauge(...), etc. are added to this global registry. If your application imports a shared library that also registers metrics, or if multiple application modules are loaded in the same process, they all share the registry. Two fixes: (1) Create a custom registry: registry = CollectorRegistry() and pass it to every metric: Counter("name", "help", registry=registry). Then expose it: generate_latest(registry). This is the cleanest solution for libraries. (2) Unregister unwanted collectors: REGISTRY.unregister(PROCESS_COLLECTOR). For application code, the global registry is usually fine; the problem typically indicates a dependency boundary issue that should be solved with custom registries.
